2022-05-07

Introduction

Learning to use tidyverse for data exploration and modelling and bla bla

Materials

NHANES glycohemoglobin data

National Health and Nutrition Examination Survey data concerning glycohemoglobin levels and diabetes mellitus (DM) from https://hbiostat.org/data/.

Why this dataset?

  • Managable size: 20 variables, 6795 observations
  • Wide spectrum of variables
  • Contains missing values to handle
  • Explore correlations between diagnosis with DM and the other variables

The data

Variable Description Units Levels
seqn Unique patient ID
sex Gender 0, 1
age Age Years 12 - 80
re Race/ethnicity 5 levels: White, Black, Mexican, Other Hispanic, Other
income Family income level $ 14 levels from 0 - 100000
tx On Insulin or Diabetes meds 0, 1
dx Diagnosed with DM or pre-DM 0, 1
wt Weight kg 28 - 239.4
ht Height cm 123.3 - 202.7
bmi Body-mass index kg/m^2 13.18 - 84.87
leg Upper leg length cm 20.4 - 50.6
arml Upper arm length cm 24.8 - 47
armc Arm circumference cm 16.8 - 61
waist Waist circumference cm 52 - 179
tri Triceps skinfold thickness mm 2.6 - 41.1
sub Subscapular skinfold thickness mm 3.8 - 40.4
gh Glycohemoglobin % 4 - 16.4
albumin Albumin g/dL 2.5 - 5.3
bun Blood urea nitrogen mg/dL 1 - 90
SCr Serum Creatinine mg/dL 0.14 - 15.66

Variable types

Variable Description Units Levels
seqn Unique patient ID
sex Gender 0, 1
age Age Years 12 - 80
re Race/ethnicity 5 levels: White, Black, Mexican, Other Hispanic, Other
income Family income level $ 14 levels from 0 - 100000
tx On Insulin or Diabetes meds 0, 1
dx Diagnosed with DM or pre-DM 0, 1
wt Weight kg 28 - 239.4
ht Height cm 123.3 - 202.7
bmi Body-mass index kg/m^2 13.18 - 84.87
leg Upper leg length cm 20.4 - 50.6
arml Upper arm length cm 24.8 - 47
armc Arm circumference cm 16.8 - 61
waist Waist circumference cm 52 - 179
tri Triceps skinfold thickness mm 2.6 - 41.1
sub Subscapular skinfold thickness mm 3.8 - 40.4
gh Glycohemoglobin % 4 - 16.4
albumin Albumin g/dL 2.5 - 5.3
bun Blood urea nitrogen mg/dL 1 - 90
SCr Serum Creatinine mg/dL 0.14 - 15.66

DX does not differentiate between type I and type II

Variables containing NAs

Variable Description Units Levels
seqn Unique patient ID
sex Gender 0, 1
age Age Years 12 - 80
re Race/ethnicity 5 levels: White, Black, Mexican, Other Hispanic, Other
income Family income level $ 14 levels from 0 - 100000
tx On Insulin or Diabetes meds 0, 1
dx Diagnosed with DM or pre-DM 0, 1
wt Weight kg 28 - 239.4
ht Height cm 123.3 - 202.7
bmi Body-mass index kg/m^2 13.18 - 84.87
leg Upper leg length cm 20.4 - 50.6
arml Upper arm length cm 24.8 - 47
armc Arm circumference cm 16.8 - 61
waist Waist circumference cm 52 - 179
tri Triceps skinfold thickness mm 2.6 - 41.1
sub Subscapular skinfold thickness mm 3.8 - 40.4
gh Glycohemoglobin % 4 - 16.4
albumin Albumin g/dL 2.5 - 5.3
bun Blood urea nitrogen mg/dL 1 - 90
SCr Serum Creatinine mg/dL 0.14 - 15.66

Methods

Data journey

Data cleaning - Imputation of NAs

Variable Description Units Levels
income Family income level $ 14 levels from 0 - 100000

Here we assigned the mean from all non-NA values of income.

Variable Description Units Levels
leg Upper leg length cm 20.4 - 50.6
arml Upper arm length cm 24.8 - 47
armc Arm circumference cm 16.8 - 61
waist Waist circumference cm 52 - 179
tri Triceps skinfold thickness mm 2.6 - 41.1
sub Subscapular skinfold thickness mm 3.8 - 40.4

Here we implemented KNN (K=5) in tidyverse. We did not optimize K.

Data cleaning - Removal of outliers

Biochemical variables have more outliers

Variable Description Units Levels
SCr Serum Creatinine mg/dL 0.14 - 15.66

Normal range is 0.6 - 1.2 mg/dL, 5+ indicates severe kidney impairment. We removed all values above 5 (17 total values). Source: https://www.medicinenet.com/creatinine_blood_test/article.htm

Results & Discussion

Explorative data analysis

Linear correlation between numeric variables



Positive correlations primarily betweeen body-size related variables.

Explorative data analysis

Diagnosis status across BMI class

  • Excess weight and increasing obesity levels seem to be a contributing factor to the development of diabetes.

Explorative data analysis

Age as a contributing factor to diagnosis across BMI class

  • Older individuals tend to be diagnosed to a greater extent compared to younger individuals.
  • Increasing obesity levels seem to negatively influence the age of diagnosis.

Explorative data analysis

Treatment status of different ethnicity and age

  • Older individuals tend to receive treatment to a larger extent compared to younger individuals.
  • No correlation between treatment status and ethnicity.

Explorative data analysis

Influence of income and ethnicity on treatment status

Annual income levels and ethinicity do not seem to influence treatment status.

Explorative data analysis

Serum albumin levels in relation to diagnosis

Serum albumin is lower in diagnosed compared to non-diagnosed individuals.

Principal Component Analysis

Investigation of patterns concerning diagnosis of diabetes mellitus

Variables dx, tx, leg, arml, wt and ht were excluded

Principal Component Analysis

Investigation of patterns in relation to BMI

Variables bmi, wt and ht were excluded

K-means clustering

Identify relevant number of clusters

K-means clustering

Clusters between age and all other variables

Single Parameter Logistic Regression performance

Impact Estimation with confidence interval

Slide

Logistic Regression

Conclusion

Diagnosis of DM correlates with age, blood glucose, bmi, …. Income and race does not appear to predict DM diagnosis or treatment status Blood glucose overrules other variables in predicting DM diagnosis Cannot cluster patients based on these variables alone

Further research: Appears that older people who have diabetes tend to be treated more often than younger people with diabetes